Back to the index page.

Intro

Now that we have developed several ways to create categorical palettes, we can appreciate how many times different clusterings on different colorspaces create similar colors. If the colors are too similar, the palette should be modified (e.g. rotating, or choosing a different colorspace), but if they are similar but still distinguishable, they can still be used. However, in a 2D map such as a tSNE plot, similar clusters of dots may be assigned similar colors, making their interpretation more difficult. The issue is a “map colouring” issue, with the added constraint of using every colour once. I have attempted at giving a solution combining two closely related fields: graph theory and computational geometry.

Colouring a tSNE with separate colors

The solution that I propose is not at all elegant, but seems to do the job. It consists of several steps:
  1. Find cluster centroids
  2. Connect them by Delauney triangulation, obtaining an undirected graph
  3. Calculate distances on Delauney triangulation edges only
  4. Input the graph with the distances as weights into a Bellman-Ford shortest path algorithm in order to find the most connected centroid, i.e. the centroid with most points in its vicinity
  5. Start from the most connected centroid and order all the other points in terms of Euclidean distance from this centroid, assign it a the first color of the palette
  6. Order the other colors by DeltaE distance to the first color on the CIE 1976 space, which is the “perceivable” difference
  7. Assign most distinct colors to points closest to the most connected centroid
  8. Check pairwise distance on colors that share a Voronoi edge
  9. Back to 7), using the second color
  10. Take the starting color that has the maximum minimum distance between all shared Voronoi edges


I will illustrate the procedure step by step.

1: Find cluster centroids in the tSNE

This was done in the palette tools page, and is quite simple. The number of clusters is for convenience the number of colors in the palette. We will use the CLARA RGB palette for this, which has 8 colors. We label each centroid on the plot with the corresponding cluster number.

getCenter <- function(o)
{
    xy = 1:2
    xy[1] = mean(o[,1])
    xy[2] = mean(o[,2])
    return(xy)
}

tmp = dist(tsne)
    hc = hclust(tmp)
    cutt = cutree(hc, k = 8)
    dat = as.data.frame(cbind(tsne,cutt))   #Add the clustering assignment to the tSNE coordinates

    plot(dat[,1], dat[,2], col = "gray", pch = 16, xlab = NA, ylab = NA)
    centers = matrix(0, nrow = 8, ncol = 2)
    for(i in unique(cutt)) centers[i,] = getCenter(tsne[cutt == i,])
    points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
    text(centers[,1], centers[,2], labels = 1:8, font = 2)

claram = cluster::clara(mm[,1:3], k = 8) #The syntax of clara is the same as k-means
    claram = claram$medoids #Take medoids from the clara object
    clcols = vector()
    for(i in 1:8)   
        {
            clcols[i] = mm[which(mm[,3] == claram[i,1] & mm[,1] == claram[i,2] & mm[,2] == claram[i,3]),4]
        }

2: Connect centroids by Delauney triangulation

The Delauney triangulation of a set of points connects points in such a way that no point in this set is inside the circumcircle of any triangle of the triangulation. Another interesting property is that it contains the Euclidean minimum spanning tree of the set of points. Thus the Delauney triangulation of a set of points won’t connect all of them to each other, but will create enough edges that a shortest path to visit all of them can be found, and points close in space will be connected more likely than those farther away. In practical terms it is a way of reducing the set of distances to be calculated, so that the computation of the shortest path can be quick and find easily the most connected centerpoints.

vtess <- deldir(centers[,1], centers[,2], rw = c(range(dat[,1]), range(dat[,2])))
    plot(dat[,1], dat[,2], col = "gray", pch = 16, xlab = NA, ylab = NA)
    plot(vtess, wlines="triang", wpoints="dummy", number=FALSE, add=TRUE, lty=1, col = "black")
    points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
    text(centers[,1], centers[,2], labels = 1:8, font = 2)

We then calculate pairwise distances among all points, but only retain those in the nodes connected by the Delauney triangulation. Then we create a matrix that is suitable for shortest path calculation.

dat.tmp = centers
dist.tmp = dist(dat.tmp)

tri <- vtess$delsgs
tri$dist = tri$ind1
for(i in 1:nrow(tri)) 
    {   
        tri$dist[i] = as.matrix(dist.tmp)[tri$ind1[i], tri$ind2[i]]
    }
arcs = tri[,5:7]
colnames(arcs) = c("source", "target", "weight")
arcs = as.matrix(arcs)

3: Compute shortest path from each node, choose the best one

Now we apply the Bellman-Ford shortest path algorithm on the Delauney graph, identifying the shortest path to all the other nodes starting from each node. The weights are the Euclidean distances between the edges.

library(optrees)
BF_dists = list()
dists = vector()
        for(k in 1:max(unique(cutt))) 
            {
                BF_dists[[k]] = getShortestPathTree(1:max(unique(cutt)), arcs, source.node = k, algorithm = "Bellman-Ford", directed = F, show.data = F, show.graph = F, show.distances = F)
                dists[k] = sum(BF_dists[[k]]$distances) 
            }

The dists vector contains the sum of distances walked in the calculation of the shortest path for each node, so we want to take the node with the smallest distance walked:

initial = which(dists == min(dists))

4: Start with a random color, then add most distant color to closest nodes

We will begin from centroid number 6, using a random color from our palette:

set.seed(13)
colordf = data.frame(1:8, rep("gray", 8), stringsAsFactors =  F)
colnames(colordf) = c("cluster", "color")
colordf[initial, 2] = sample(clcols, 1)
plot(dat[,1], dat[,2], col = colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)

We then need to go by closest to most distant centroid:

nextcols = as.numeric(names(as.matrix(dist.tmp)[initial,][order(as.matrix(dist.tmp)[initial,])][2:ncol(as.matrix(dist.tmp))]))
nextcols
## [1] 3 4 2 1 7 8 5

And to these centroids we assign colors in order of \(\Delta\)E, the “perceived difference” in colors. Two colors with a \(\Delta\)E below 2.3 are impossible to distinguish by the human eye.

dist_to_initial = vector()
for(i in 1:length(clcols)) dist_to_initial[i] = colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[initial, 2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(clcols[i])@coords, from = "sRGB", to = "Lab")
)
    

    names(dist_to_initial) = clcols
dist_to_initial = dist_to_initial[order(dist_to_initial, decreasing = T)]
dist_to_initial = dist_to_initial[1:(length(dist_to_initial)-1)]
colordf[nextcols,2] = names(dist_to_initial)

We can now colour our tSNE plot with a qualitative palette in order to highlight clusters that are close on the plot:

plot(dat[,1], dat[,2], col = colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)

5 Check how the assignment performed using Voronoi diagrams

Although distances from the starting centroid are obviously calculated on two dimensions, when we sort them we are actually flattening them in one dimension. This means that even though two centroids (e.g. centroids 2 and 7) are not immediately one after the other in terms of distance from centroid 6, they are close together in space. We want to see how distinctive this color assignment is in terms of color distance. Since we already computed the Delauney triangulation for this set of points, we can join the centers of the circumcircles in which each Delauney triangle is inscribed, thus obtaining the Voronoi diagram for the centroids. We can see that there is a good agreement between the Voronoi cells and the actual clusters:

vtiles = tile.list(vtess)
plot(vtiles, fill = colordf$color)
points(dat[,1], dat[,2], col = "black", bg =  colordf[dat[,3],2], pch = 21, xlab = NA, ylab = NA, cex = 0.7)

points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:8, font = 2)

If two Voronoi tessels share an edge, they are obviously adjacent in space, and as we saw earlier it is a good approximation for clusters being adjacent to each other. To find shared edges we will use the spatstat package, and a function found on this answer on Stack Overflow.

library(spatstat)

cps <- ppp(centers[,1], centers[,2],window=owin(range(dat[,1]), range(dat[,2])))

sharededge <- function(X) {
   verifyclass(X, "ppp")
   Y <- X[as.rectangle(X)]
   dX <- deldir(Y)
   DS <- dX$dirsgs
   xyxy <- DS[,1:4]
   names(xyxy) <- c("x0","y0","x1","y1")
   sX <- as.psp(xyxy,window=dX$rw)
   marks(sX) <- 1:nobjects(sX)
   sX <- sX[as.owin(X)]
   tX <- tapply(lengths.psp(sX), marks(sX), sum)
   jj <- as.integer(names(tX))
   ans <- data.frame(ind1=DS[jj,5], 
                     ind2=DS[jj,6], 
                     leng=as.numeric(tX))
   return(ans)
}

shared_edge_lengths <- sharededge(cps)

We now check which colors are within each shared edge, and their \(\Delta\)E distances:

edgecolors = shared_edge_lengths
edgecolors$col1 = colordf[edgecolors$ind1,2]
edgecolors$col2 = colordf[edgecolors$ind2,2]
edgecolors$DeltaE.dist = apply(edgecolors, 1, 
    function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(x[4])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(x[5])@coords, from = "sRGB", to = "Lab")))

By plotting all \(\Delta\)E distances we see which two adjacent tessels have the lowest \(\Delta\)E. For this palette and with this combination the distance values are fairly good (always above 50), so that any assignment is quite distinguishable.

plot(1:nrow(edgecolors), rep(0, nrow(edgecolors)), cex = 0, ylim = c(-1, max(edgecolors$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")

segments(x0 = 1:nrow(edgecolors), y0 = rep(0, nrow(edgecolors)), x1 = 1:nrow(edgecolors), y1 =edgecolors$DeltaE.dist)
points(1:nrow(edgecolors), rep(0, nrow(edgecolors)), cex = 4, pch = 15, col = edgecolors$col1)

points(1:nrow(edgecolors), edgecolors$DeltaE.dist, cex = 4, pch = 15, col = edgecolors$col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
axis(1, at = 1:nrow(edgecolors), labels = paste(edgecolors$ind1, edgecolors$ind2, sep = "-"), las = 2)

We can also check all possible distances to have a sense of how much better this procedure was:

distpairs = expand.grid(1:8, 1:8)
distpairs$Col1 = colordf[distpairs$Var1,2]
distpairs$Col2 = colordf[distpairs$Var2,2]
distpairs$DeltaE.dist = apply(distpairs[,1:2], 1, 
    function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[x[1],2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(colordf[x[2],2])@coords, from = "sRGB", to = "Lab")))

plot(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 0, ylim = c(-1, max(distpairs$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")

segments(x0 = 1:nrow(distpairs), y0 = rep(0, nrow(distpairs)), x1 = 1:nrow(distpairs), y1 =distpairs$DeltaE.dist)
points(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 2, pch = 15, col = distpairs$Col1)

points(1:nrow(distpairs), distpairs$DeltaE.dist, cex = 2, pch = 15, col = distpairs$Col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
axis(1, at = 1:nrow(distpairs), labels = paste(distpairs$Var1, distpairs$Var2, sep = "-"), las = 2)

## Cycle through all the colors to find the optimal starting point

It is important to note that had we started with a different color in the most connected centroid, we would have generated a different order of colours, maybe maximizing the distances between shared edges. So the next logical step in this procedure is to generate an assignment for every color in the starting point, look at the distribution of color distances, and choose the assignment with the highest number of large distances:

colorassignments = list()
listcolordf = list()
for(j in 1:length(clcols))
{   
    colordf = data.frame(1:length(clcols), rep("gray", length(clcols)), stringsAsFactors =  F)
    colordf[initial, 2] = clcols[j]
    dist_to_initial = vector()
    for(i in 1:length(clcols)) 
        {
            dist_to_initial[i] = colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[initial, 2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(clcols[i])@coords, from = "sRGB", to = "Lab"))
        }
    names(dist_to_initial) = clcols
    dist_to_initial = dist_to_initial[order(dist_to_initial, decreasing = T)]
    dist_to_initial = dist_to_initial[1:(length(dist_to_initial)-1)]
    colordf[nextcols,2] = names(dist_to_initial)
    edgecolors = shared_edge_lengths
    edgecolors$col1 = colordf[edgecolors$ind1,2]
    edgecolors$col2 = colordf[edgecolors$ind2,2]
    edgecolors$DeltaE.dist = apply(edgecolors, 1, 
    function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(x[4])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(x[5])@coords, from = "sRGB", to = "Lab")))
    colorassignments[[j]] = edgecolors
    listcolordf[[j]] = colordf
}
mindists = unlist(lapply(colorassignments, function (x) min(x$DeltaE.dist[x$DeltaE.dist > 0])))

best.assignment = which(mindists == max(mindists))
if(length(best.assignment) > 1) best.assignment = best.assignment[1]

best.colordf = listcolordf[[best.assignment]]

plot(dat[,1], dat[,2], col = best.colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)
points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
text(centers[,1], centers[,2], labels = 1:length(clcols), font = 2)

And their distances:

plot(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 0, ylim = c(-1, max(colorassignments[[best.assignment]]$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")
segments(x0 = 1:nrow(colorassignments[[best.assignment]]), y0 = rep(0, nrow(colorassignments[[best.assignment]])), x1 = 1:nrow(colorassignments[[best.assignment]]), y1 =colorassignments[[best.assignment]]$DeltaE.dist)
points(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col1)

points(1:nrow(colorassignments[[best.assignment]]), colorassignments[[best.assignment]]$DeltaE.dist, cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
axis(1, at = 1:nrow(colorassignments[[best.assignment]]), labels = paste(colorassignments[[best.assignment]]$ind1, colorassignments[[best.assignment]]$ind2, sep = "-"), las = 2)
abline(h = 40, lty = 2, lwd = 0.4)

Now we package everything in a function:

colorTSNE <- function(tsnecoords, palette, output.type = "tsneplot")
{   
    ####This will be removed when the input is a tSNEplot with its own clustering!!!
    tmp = dist(tsnecoords)
    hc = hclust(tmp)
    cutt = cutree(hc, k = length(palette))
    dat = as.data.frame(cbind(tsnecoords,cutt)) #Add the clustering assignment to the tSNE coordinates
    #####
    
    centers = matrix(0, nrow = length(palette), ncol = 2)
    for(i in unique(cutt)) 
        {
            centers[i,] = getCenter(tsne[cutt == i,])
        }
    #First step: Delauney triangulation and Voronoi tessellation
    vtess <- deldir(centers[,1], centers[,2], rw = c(range(dat[,1]), range(dat[,2])))
    
    dist.tmp = dist(centers)

    tri <- vtess$delsgs
    tri$dist = tri$ind1
    for(i in 1:nrow(tri)) 
    {   
        tri$dist[i] = as.matrix(dist.tmp)[tri$ind1[i], tri$ind2[i]]
    }
    arcs = as.matrix(tri[,5:7])

    #Second step: Bellman-Ford shortest path calculation
    BF_dists = list()
    dists = vector()
        for(k in 1:max(unique(cutt))) 
            {
                BF_dists[[k]] = getShortestPathTree(1:max(unique(cutt)), arcs, source.node = k, algorithm = "Bellman-Ford", directed = F, show.data = F, show.graph = F, show.distances = F)
                dists[k] = sum(BF_dists[[k]]$distances) 
            }
    initial = which(dists == min(dists))

    #Third step: cycle through all the colors and check the pairwise distance between Voronoi cells with shared edges choosing the best starting color
    nextcols = as.numeric(names(as.matrix(dist.tmp)[initial,][order(as.matrix(dist.tmp)[initial,])][2:ncol(as.matrix(dist.tmp))]))
    cps <- ppp(centers[,1], centers[,2],window=owin(range(dat[,1]), range(dat[,2])))
    shared_edge_lengths <- sharededge(cps)
    colorassignments = list()
    listcolordf = list()
    for(j in 1:length(palette))
        {   
    colordf = data.frame(1:length(palette), rep("gray", length(palette)), stringsAsFactors =  F)
    colordf[initial, 2] = palette[j]
    dist_to_initial = vector()
        for(i in 1:length(palette)) 
            {
                dist_to_initial[i] = colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[initial, 2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(palette[i])@coords, from = "sRGB", to = "Lab"))
            }
        names(dist_to_initial) = palette
        dist_to_initial = dist_to_initial[order(dist_to_initial, decreasing = T)]
        dist_to_initial = dist_to_initial[1:(length(dist_to_initial)-1)]
        colordf[nextcols,2] = names(dist_to_initial)
        edgecolors = shared_edge_lengths
        edgecolors$col1 = colordf[edgecolors$ind1,2]
        edgecolors$col2 = colordf[edgecolors$ind2,2]
        edgecolors$DeltaE.dist = apply(edgecolors, 1, 
        function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(x[4])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(x[5])@coords, from = "sRGB", to = "Lab")))
        colorassignments[[j]] = edgecolors
        listcolordf[[j]] = colordf
        }
    
    mindists = unlist(lapply(colorassignments, function (x) min(x$DeltaE.dist[x$DeltaE.dist > 0])))
    best.assignment = which(mindists == max(mindists))

    if(length(best.assignment) > 1) best.assignment = best.assignment[1]
    best.colordf = listcolordf[[best.assignment]]
    colnames(best.colordf) = c("cluster", "color")
    if(output.type == "tsneplot")
        {
        plot(dat[,1], dat[,2], col = best.colordf[dat[,3],2], pch = 16, xlab = NA, ylab = NA)
        points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
        text(centers[,1], centers[,2], labels = 1:length(palette), font = 2)
            
        } else
    if(output.type == "distanceplot")
        {
        plot(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 0, ylim = c(-1, max(colorassignments[[best.assignment]]$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")
        segments(x0 = 1:nrow(colorassignments[[best.assignment]]), y0 = rep(0, nrow(colorassignments[[best.assignment]])), x1 = 1:nrow(colorassignments[[best.assignment]]), y1 =colorassignments[[best.assignment]]$DeltaE.dist)
        points(1:nrow(colorassignments[[best.assignment]]), rep(0, nrow(colorassignments[[best.assignment]])), cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col1)

        points(1:nrow(colorassignments[[best.assignment]]), colorassignments[[best.assignment]]$DeltaE.dist, cex = 4, pch = 15, col = colorassignments[[best.assignment]]$col2)
        axis(1, at = 1:nrow(colorassignments[[best.assignment]]), labels = paste(colorassignments[[best.assignment]]$ind1, colorassignments[[best.assignment]]$ind2, sep = "-"), las = 2)
        abline(h = 40, lty = 2, lwd = 0.4)
            
    } else
    
    if(output.type == "assignment")
        return(best.colordf)
}

Limit cases

Distributing a large palette

A random color assignment would have generated some color pairs that are slightly harder to distinguish. However, this difficulty increases when we increase the number of colors generated by clustering. Let’s see it with 15 colors:

claram = cluster::clara(mm[,1:3], k = 15) #The syntax of clara is the same as k-means
    claram = claram$medoids #Take medoids from the clara object
    clcols = vector()
    for(i in 1:15)  
        {
            clcols[i] = mm[which(mm[,3] == claram[i,1] & mm[,1] == claram[i,2] & mm[,2] == claram[i,3]),4]
        }

tmp = dist(tsne)
    hc = hclust(tmp)
    cutt = cutree(hc, k = 15)
    dat = as.data.frame(cbind(tsne,cutt))   #Add the clustering assignment to the tSNE coordinates

    plot(dat[,1], dat[,2], pch = 16, xlab = NA, ylab = NA, col = clcols[dat[,3]])
    centers = matrix(0, nrow = 15, ncol = 2)
    for(i in unique(cutt)) centers[i,] = getCenter(tsne[cutt == i,])
    points(centers[,1], centers[,2], cex = 3, pch = 21, bg = "white")
    text(centers[,1], centers[,2], labels = 1:length(clcols), font = 2)

We can see how many more instances of less distinguishable couples can occur:

colordf = data.frame(1:length(clcols), clcols, stringsAsFactors = F)
distpairs = expand.grid(1:length(clcols), 1:length(clcols))
distpairs$Col1 = colordf[distpairs$Var1,2]
distpairs$Col2 = colordf[distpairs$Var2,2]
distpairs$DeltaE.dist = apply(distpairs[,1:2], 1, 
    function (x) colorscience::deltaE1976(grDevices::convertColor(hex2RGB(colordf[x[1],2])@coords, from = "sRGB", to = "Lab"), grDevices::convertColor(hex2RGB(colordf[x[2],2])@coords, from = "sRGB", to = "Lab")))
plot(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 0, ylim = c(-1, max(distpairs$DeltaE.dist)+10), xaxt = "n", xlab = NA, ylab = "DeltaE color distance")

segments(x0 = 1:nrow(distpairs), y0 = rep(0, nrow(distpairs)), x1 = 1:nrow(distpairs), y1 =distpairs$DeltaE.dist)
points(1:nrow(distpairs), rep(0, nrow(distpairs)), cex = 0.5, pch = 15, col = distpairs$Col1)

points(1:nrow(distpairs), distpairs$DeltaE.dist, cex = 0.5, pch = 15, col = distpairs$Col2)
#bpl = barplot(edgecolors$DeltaE.dist, plot = F)
#axis(1, at = 1:nrow(distpairs), labels = paste(distpairs$Var1, distpairs$Var2, sep = "-"), las = 2)
abline(h = 40, lty = 2, lwd = 0.4)

Whereas using this procedure:

colorTSNE(tsne, clcols, output.type = "tsneplot")

The distances seem to be optimal:

colorTSNE(tsne, clcols, output.type = "distanceplot")

Using a quantitative palette

A quantitative palette such as those generated by the viridis package should not be used as a qualitative one, but it represents an interesting limit case due to the fact that colors are interpolated and will be similar by definition. We try using 10 colors:

colorTSNE(tsne, viridis(10, option = "B"), output.type = "tsneplot")

Distances are not optimal:

colorTSNE(tsne, viridis(10, option = "B"), output.type = "distanceplot")


Back to top
Back to the index page.